UNHAS: Monitoring chlorophyll-a in Lake Tempe

Contents

Overview

Inland waterbodies are essential for supporting human life, both through the supply of drinking water and the support of agriculture and aquaculture. Such waterbodies can be contaminated by outbreaks of blue-green (and other toxic) algae, causing health issues for people and animals. For example, up to a million fish died during an algal bloom event in the Darling river in late 2018 and early 2019. While the health of waterbodies can be monitored from the ground through sampling, satellite imagery can complement this, potentially improving the detection of large algal bloom events. However, there needs to be a well-understood and tested way to link satellite observations to the presence of algal blooms.

Sentinel-2 use case

Algal blooms are associated with the presence of clorophyll-a in waterbodies. Mishra and Mishra (2012) developed the normalised difference chlorophyll index (NDCI), which serves as a qualitative indicator for the concentration of chlorophyll-a on the surface of a waterbody. The index requires information from a specific part of the infrared specturm, known as the 'red edge'. This is captured as part of Sentinel-2's spectral bands, making it possible to measure the NDCI with Sentinel-2.

On the other hand, the following caveats should also be kept in mind:

Description

In this example, we measure the NDCI for Lake Tempe, South Sulawesi, Indonesia. This is combined with information about the size of the waterbody, which is used to build a helpful visualisation of how the water level and presence of chlorophyll-a changes over time. The worked example takes users through the code required to:

  1. Interrogating the EASI / ODC database for products (satellites) and measurements (satellite bands and derived products)
  2. Use of Dask for distributed processing of large datasets
  3. Load Sentinel-2 images for the area of interest and clean up the imagery (clouds, 'no-data' pixels, etc.)
  4. Using a shape file to mask the dataset
  5. Compute indices to measure the presence of water and chlorophyll-a
  6. Generate informative visualisations to identify the presence of chlorophyll-a.

This notebook is adapted from a Digital Earth Australia example, and can be found in the following GitHub repository.

Notebook setup

Here, we load the key Python packages and supporting functions for the subsequent analysis.

The EASI Tools section comes from https://dev.azure.com/csiro-easi/easi-hub-public/_git/hub-notebooks. Git clone this repository to use these tools.

And let's now connect to the datacube database, which provides functionality for loading and displaying stored Earth observation data:

Dask

Rather than loading the Sentinel-2 data into the system memory, this notebook will make use of Dask to load up and process the dataset.

Dask is a Python library for distributed (parallel) processing, which allows us to work on very large array or datacube objects that would otherwise not fit in the memory of any single compute node. Dask relies on the creation of a Dask Gateway (or local) cluster with associated scheduler, which is done in the following cell through the initialize_dask helper function.

The Dashboard link provided above (Client section) can be used to monitor the status of the Dask cluster and associated processing tasks during the execution of various cells in the rest of this notebook.

Region of interest

Analysis parameters

The following cell sets the analysis parameters, which define the area of interest and the length of time to conduct the analysis over.

Selected location display

The next cell will display the selected area on an interactive map. Feel free to zoom in and out to get a better understanding of the area you'll be analysing. Clicking on any point of the map will reveal the latitude and longitude coordinates of that point.

Sentinel-2 dataset

Interrogating the DC database

We can obtain a summary of the various data products available from the current ODC / EASI database, as follows.

Another way to investigate the available datasets is through the ODC Explorer interface for the current EASI deployment.

In this notebook, we will make use of the dataset of Sentinel-2 imagery, labelled s2_l2a. We can further interrogate the specific bands (measurements) available within this product:

Further below, we will mask the data to identify the pixels containing valid information (i.e. not no-data), as well as those pixels not affected by clouds, cloud shadow, etc. (i.e. clear pixels). Information regarding the quality of each pixel is provided in the SCL data layer (with mask or qa aliases), which can be investigated in more detail as follows.

Loading the data

The first step in the analysis is to load Sentinel-2 data for the specified area of interest and time range. We also specifically select various spectral bands to load: the red, green and blue bands will allow for true-colour displays of the Sentinel-2 data, while other bands will be used for the calculation of water-related indices further below. In addition, the pixel QA band fmask will be used to mask out the pixels affected by clounds, shadows, etc.

Sentinel-2 datasets are stored with different coordinate reference systems (CRS), corresponding to multiple UTM zones used for S2 L1B tiling. S2 measurement bands also have different resolutions (10 m, 20 m and 60 m). As such S2 queries need to include the following two query parameters:

Use mostcommon_crs() to select a CRS. Adapted from https://github.com/GeoscienceAustralia/dea-notebooks/blob/develop/Tools/dea_tools/datahandling.py

Cleaning up the data

Next, we will mask out any clouds (and other pixel QA issues) in the dataset, and also filter out any image where less than 70% of the pixels are classified as clear observations. Note here that we're including snow pixels in the dataset of "good" pixels – quite a few pixels in the time series are identified as being snow, but snow cover is highly improbable at the selected location. Instead, these snow pixels are very likely to actually represent clear observations that are mis-classified, which is likely to occur in salt lake regions with highly reflective land covers. Rather than removing these likely mis-classified pixels, we thus include them in the set of clear observations.

Finally, we can apply the resulting 'good pixel' mask to the data, as well as a further 'no-data' mask and further scaling operations.

Once the processing is complete, we can examine the data as done in the next cell. The Dimensions argument reveals the number of time steps in the dataset, as well as the number of pixels in the x (longitude) and y (latitude) dimensions. Clicking on the data repr icon for each of the Data variables also confirms that the data is currently loaded as Dask chunks on the Dask cluster.

Visualise the data

We can visualise the data using a true-colour image display for a number of time steps. White areas indicate where clouds or other invalid pixels in the image have been masked.

Masking the raster data with a shapefile

Loading up the shapefile

For this example, we use a polygon of Lake Tempe's boundary, which is provided in the ancillary_data folder in this repository.

We can see here that the vector data within that shapefile is in the projection EPSG:4326, which is different from that of our main Sentinel-2 dataset (EPSG:32750). For compatibility, we can here re-project the shapefile data to the CRS of the Sentinel-2 dataset.

The shape file we've just loaded contains a large number of polygons corresponding to the various jurisdictions in South Sulawesi. So let's now select only those polygons corresponding to Lake Tempe:

We can now produce a plot of the polygons of interest from the original shape file.

Creating a raster mask

We can now create a raster mask from the vector data. The code below iterates over the polygons in the shapefile (in case multiple polygons are available), setting the raster mask values to 1 for all the pixels corresponding to the footprint of each polygon.

Applying the mask to the data time series

Finally, we can use the mask we just created, apply it to the time series of Sentinel-2 data, and plot the result.

Band indices

Band arithmetic

This study measures the presence of water through the modified normalised difference water index (MNDWI), while the amount of clorophyll-a is calculated using the normalised difference clorophyll index (NDCI).

MNDWI is calculated from the green and shortwave infrared (SWIR) bands to identify water. The formula is

$$ \begin{aligned} \text{MNDWI} = \frac{\text{Green} - \text{SWIR}}{\text{Green} + \text{SWIR}}. \end{aligned} $$

When interpreting this index, values greater than 0 indicate water.

NDCI is calculated from the red edge 1 and red bands to identify water. The formula is

$$ \begin{aligned} \text{NDCI} = \frac{\text{Red_edge_1} - \text{Red}}{\text{Red_edge_1} + \text{Red}}. \end{aligned} $$

When interpreting this index, higher values indicate a higher concentration of clorophyll-a.

Computation

As per the formulae provided above, we can now calculate the relevant indices from the various bands in our dataset. The code below achieves this, and subsequently saves the results back as new layers in the ds_s2 dataset.

The MNDWI and NDCI values should now appear as data variables, along with the loaded measurements, in the ds_s2 data set. We can check this by printing the data set below:

Summary plot

To get an understanding of how the waterbody has changed over time, the following section builds a plot that uses the MNDWI to measure (roughly) the area of the waterbody, along with the NDCI to track how the concentration of clorophyll-a is changing over time. This could be used to quickly assess the status of a given waterbody.

Analysis constants

If the pixel area is known, the number of pixels classified as water (i.e, where MNDWI > 0) can be used as a proxy for the area of the waterbody. The following cell generates the necessary constants for performing this conversion.

Total water area

The next cell starts by filtering the dataset to only keep the pixels that are classified as water. It then calculates the water area by counting all the MNDWI pixels in the filtered data set, calculating a rolling median – this helps smooth the results to account for variation from cloud-masking. This median count is then multiplied by the area-per-pixel value to obtain an estimate of the water area over time.

Average NDCI

The next cell computes the average NDCI for each time step using the filtered data set. This means that we're only tracking the NDCI in waterbody areas, and not on land.

To make the summary plot, we calculate NDCI across all pixels; this allows us to track overall changes in NDCI, but doesn't tell us where the increase occured within the waterbody (this is covered in the next section).

Combined indices plot

The cell below combines the total water area and average NDCI time series we generated above into a single summary plot. Notice that Python can be used to build highly-customised plots such as the stripe plot below.

Spatial plots

NDCI plots

While the summary plot above is useful at a glance, it can be interesting to see the full spatial picture at times where the NDCI is especially low or high. In the cell below, we first identify the time indices corresponding to the minimum and maximum NDCI values in the time series. We can then use these time indices to extract and plot the data of interest.

Alternatively, we could also manually select two specific dates to display. The code below defines two useful functions: closest_date will find the date in a list of dates closest to any given date, and date_index will return the position of a particular date in a list of dates. These functions are useful for selecting images to compare.

Run the cell below to define and plot two dates to compare.

True colour display

And finally, the cell below displays the true-colour images for the selected dates.